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[57] ABSTRACT 

A vector excitation coder compresses vectors by using 
an optimum codebook designed off line, using an initial 
arbitrary codebook and a set of speech training vectors 
exploiting codevector sparsity (i.e., by making zero all 
but a selected number of samples of lowest amplitude in 
each of N codebook vectors). A fast-search method 
selects a number N c of good excitation vectors from the 
codebook, where N c is much smaller than N, and uses 
only the N c vectors in an exhaustive search for the best 
match between a perceptually weighted input vector z n , 
and an estimate z n derived from a codebook vector 
processed through long-term and short-terms filters, 
and a perceptual weighting filter. The zero input re- 
sponse of these cascaded filters is calculated and sub- 
tracted from an input speech vector s« after perceptual 
weighting to produce a vector r«. The codebook search 
operation is performed using 

l£c£_ 

11 H 2 Rhh(0) R{ c (0) + 2 V Rhh(f) R{ c (i) 

by calculating the numerator of a fast inner product and 
calculating the denominator by a fast inner product for 
each codebook vector c /, computing the right hand side 
of the equation once per frame, and then cross multiply- 
ing the numerators and denominators to determine if 
N2/D2 is less than N1/D1 by determining if 
NiD 2 >N 2 Di. If not N2 and D2 replace Nt and Di in 
registers E n and E 4 , 
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VECTOR EXCITATION SPEECH OR AUDIO 

CODER FOR TRANSMISSION OR STORAGE 

ORIGIN OF INVENTION 5 

The invention described herein was made in the per- 
formance of work under a NASA contract, and is sub- 
ject to the provisions of Public Law 96-517 (35 USC 
202) under which the inventors were granted a request 
to retain title. 10 

BACKGROUND OF THE INVENTION 

This invention relates to a vector excitation coder 
which efficiently compresses vectors of digital voice or 
audio for transmission or for storage, such as on mag- 15 
netic tape or disc. 

In recent developments of digital transmission of 
voice, it has become common practice to sample at 8 
kHz and to group the samples into blocks of samples. 
Each block is commonlY referred to as a “vector” for a 20 
type of coding processing called Vector Excitation 
Coding (VXC). It is a powerful new technique for en- 
coding analog speech or audio into a digital representa- 
tion. Decoding and reconstruction of the original ana- 
log signal permits quality reproduction of the original 25 
signal. 

Briefly, the prior art VXC is based on a new and 
general source-filter modeling technique in which the 
excitation signal for a speech production model is en- 
coded at very low bit rates using vector quantization. 30 
Various architectures for speech coders which fall into 
this class have recently been shown to reproduce 
speech with very high perceptual quality. 

In a generic VXC coder, a vocal-tract model is used 
in conjunction with a set of excitation vectors (codevec- 35 
tors) and a perceptually-based error criterion to synthe- 
size natural-sounding speech. One example of such a 
coder is Code Excited Linear Prediction (CELP), 
which uses Gaussian random variables for the codevec- 
tor components. M. R. Schroeder and B. S. Atal, 40 
“Code-Excited Linear Prediction (CELP): High-Qual- 
ity Speech at Very Low Bit Rates,” Proceedings Int’l. 
Conference on Acoustics, Speech, and Signal Process- 
ing, Tampa, March, 1985 and M. Copperi and D. 
Sereno, “CELP Coding for High-Quality Speech at 8 45 
kbits/s,” Proceedings Int’l. Conference on Acoustics, 
Speech, and Signal Processing, Tokyo, April, 1986. 
CELP achieves very high reconstructed speech quality, 
but at the cost of astronomic computational complexity 
(around 440 million multiply/add operations per second 50 
for real-time selection of the optimal codevector for 
each speech block). 

In the present invention, VXC is employed with a 
sparse vector excitation to achieve the same high recon- 
structed speech quality as comparable schemes, but 55 
with significantly less computation. This new coder is 
denoted Pulse Vector Excitation Coding (PVXC). A 
variety of novel complexity reduction methods have 
been developed and combined, reducing optimal code- 
vector selection computation to only 0.55 million multi- 60 
ply/adds per second, which is well within the capabili- 
ties of present data processors. This important charac- 
teristic makes the hardware implementation of a real- 
time PVXC coder possible using only one programma- 
ble digital signal processor chip, such as the AT&T 65 
DSP32. Implementation of similar speech coding algo- 
rithms using either programmable processors or high- 
speed, special-purpose devices is feasible but very im- 


2 

practical due to the large hardware complexity re- 
quired. 

Although PVXC of the present invention employs 
some characteristics of multipulse linear predictive cod- 
ing (MPLPC) where excitation pulse amplitudes and 
locations are determined from the input speech, and 
some characteristics of CELP, where Gaussian excita- 
tion vectors are selected from a fixed codebook, there 
are several important differences between them. PVXC 
is distinguished from other excitation coders by the use 
of a precomputed and stored set of pulse-like (sparse) 
codevectors. This form of vocal-tract model excitation 
is used together with an efficient error minimization 
scheme in the Sparse Vector Fast Search (SVFS) and 
Enhanced SVFS complexity reduction methods. Fi- 
nally, PVXC incorporates an excitation codebook 
which has been optimized to minimize the perceptually- 
weighted error between original and reconstructed 
speech waveforms. The optimization procedure is based 
on a centroid derivation. In addition, a complexity re- 
duction scheme called Spectral Classification (SPC) is 
disclosed for excitation coders using a conventional 
codebook (fully-populated codevector components). 
There is currently a high demand for speech coding 
techniques which produce high-quality reconstructed 
speech at rates around 4.8 kb/s Such coders are needed 
to close the gap which exists between vocoders with an 
“electronic-accent” operating at 2.4 kb/s and newer, 
more sophisticated hybrid techniques which produce 
near toll-quality speech at 9.6 kb/s. 

For real-time implementations, the promise of VXC 
has been thwarted somewhat by the associated high 
computational complexity. Recent research has shown 
that the dominant computation (excitation codebook 
search) can be reduced to around 40 M Flops without 
compromising speech quality However, this operation 
count is still too high to implement a practical real-time 
version using only a few current-generation DSP chips. 
The PVXC coder described herein produces natural- 
sounding speech at 4.8 kb/s and requires a total compu- 
tation of only 1.2 M Flops. 

OBJECTS AND SUMMARY OF THE 
INVENTION 

The main object of this invention is to reduce the 
complexity of VXC speech coding techniques without 
sacrificing the perceptual quality of the reconstructed 
speech signal in the ways just mentioned. 

A further object is to provide techniques for real-time 
vector excitation coding of speech at a rate below the 
midrate between 2.4 kb/s and 9.6 kb/s. 

In the present invention, a fully-quantized PVXC 
produces natural-sounding speech at a rate well below 
the midrate between 2.4 kb/s and 9.6 kb/s. Near toll- 
quality reconstructed speech is achieved at these low 
rates primarily by exploiting codevector sparsity, by 
reformulating the search procedure in a mathematically 
less complex (but essentially equivalent) manner, and by 
precomputing intermediate quantities which are used 
for multiple input vectors in one speech frame. The 
coder incorporates a pulse excitation codebook which is 
designed using a novel perceptually-based clustering 
algorithm. Speech or audio samples are converted to 
digital form, partitioned into frames of L samples, and 
further partitioned into groups of k samples to form 
vectors with a dimension of k samples. The input vector 
s n is preprocessed to generate a perceptual weighted 
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vector z n , which is then subtracted from each member 
of a set of N weighted synthetic speech vectors {z je 
{1, . . . , N}, where N is the number of excitation vectors 
in the codebook. The set {zy} is generated by filtering 
pulse excitation (PE) codevectors cy with two time- 5 
varying, cascaded LPC synthesis filters H/(z) and H^z). 

In synthesizing {z :j}, each PE code- vector is scaled by a 
variable gain Gy (determined by minimizing the mean- 
squared error between the weighted synthetic speech 
signal zy and the weighted input speech vector z rt ), fil- 10 
tered with cascaded long-term and short-term LPC 
synthesis filters, and then weighted by a perceptual 
weighting filter. The reason for perceptually weighting 
the input vector z n and the synthetic speech vector with 
the same weighting filter is to shape the spectuum of the 15 
error signal so that it is similar to the spectrum of s n , 
thereby masking distortion which would otherwise be 
perceived by the human ear. 

In the paragraph above, and in all the text that fol- 
lows, a tilde (~) over a letter signifies the incorporation 20 
of a perceptual weighting factor, and a circumflex (-) 
signifies an estimate. 

An exhaustive search over N vectors is performed for 
every input vector s n to determine the excitation vector 
cy which minimizes the squared Euclidean distortion 25 
|| e/ || 2 between z n and zy. Once the optimal cy is se- 
lected, a codebook index which identifies it is transmit- 
ted to the decoder together with its associated gain. The 
parameters of H/(z) and Hs(z) transmitted as side infor- 
mation once per input speech frame (after every (L /k)th 30 
s n vector). 

A very useful linear systems representation of the 
synthesis filters and H*(z) and H/(z) is employed. Code- 
book search complexity is reduced by removing the 
effect of the deterministic component of speech (pro- 35 
duced by synthesis filter memory from the previous 
vector — the zero input response) on the selection of the 
optimal codevector for the current input vector s n . This 
is performed in the encoder only by first finding the 
zero-input response of the cascaded synthesis and 40 
weighting filters. The difference z n between a weighted 
input speech vector r n and this zero-input response is 
the input vector to the codebook search. The vector r n 
is produced by filtering s n with W(z), the perceptual 
weighting filter. With the effect of the deterministic 45 
component removed, the initial memory values in H*(z) 
and Hi(z) can be set to zero when synthesizing {zy} 
without affecting the choice of the optimal codevector. 
Once the optimal codevector is determined, filter mem- 
ory from the previous encoded vector can be updated 50 
for use in encoding the subsequent vector. Not only 
does this filter representation allow further reduction in 
the computation necessary by efficiently expressing the 
speech synthesis operation as a matrix-vector product, 
but it also leads to a centroid calculation for use in 55 
optimal codebook design routines 

The novel features that are considered characteristic 
of this invention are set forth with particularity in the 
appended claims. The invention will best be understood 
from the following description when read in conjunc- 60 
tion with the accompanying drawings. 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a block diagram of a VXC speech encoder 
embodying some of the improvements of this invention. 65 

FIG. la is a graph of segmented SNR (SNR seg ) and 
overall codebook search complexity versus number of 
pulses per vector, N p . 
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FIG. lb is a graph of segmented SNR (SNR^) and 
overall codebook search complexity versus number of 
good candidate vectors, N c , in the two-step fast-search 
operation of FIG. 4 a and FIG. 4 b. 

FIG. 2 is a block diagram of a PVXC speech encoder 
embodying the present invention. 

FIG. 3 illustrates in a functional block diagram the 
codebook search operation for the system of FIG. 2 
suitable for implementation using programmable signal 
processors. 

FIG. 4a is a functional block diagram which illus- 
trates Spectral Classification, a two-step fast-search 
operation. 

FIG. 4b is a block diagram which expands a func- 
tional block 40 in FIG. 4a. 

FIG. 5 is a schematic diagram disclosing a preferred 
embodiment of the architecture for the PVXC speech 
encoder of FIG. 2. 

FIG. 6 is a flow chart for the preparation and use of 
an excitation codebook in the PVXC speech encoder of 
FIG. 2. 

DESCRIPTION OF PREFERRED 
EMBODIMENTS 

Before describing preferred embodiments of PVXC, 
the present invention, a VXC structure will first be 
described with reference to FIG. 1 to introduce some 
inventive concepts and show that they can be incorpo- 
rated in any VXC-type system. The original speech 
signal s n is a vector with a dimension of k samples. This 
vector is weighted by a time-varying perceptual 
weighting filter 10 to produce z n , which is then sub- 
tracted from each member of a set of N weighted syn- 
thetic speech vectors {zy}, je {1, ...» N} in an adder 11. 
The set {zy} is generated by filtering excitation codevec- 
tors cy (originating in a codebook 12) with cascaded 
long-term synthesizer (synthesis filter) filter 13 a short- 
term synthesizer (synthesis filter) 14 a and a perceptual 
weighting filter 14b. Each codevector cy is scaled in an 
amplifier 15 by a gain factor Gy (computed in a block 16) 
which is determined by minimizing the mean-squared 
error e/ between zy and the perceptually weighted 
speech vector z n . In an exhaustive search VXC coder of 
this type, an excitation vector cy is selected in block 15a 
which minimizes the squared Euclidean error || ey|| 2 
resulting from a comparison of vectors z n and every 
member of the set {zy}. An index l n having log 2 N bits 
which identifies the optimal cy is transmitted for each 
input vector s n , along with Gy and the synthesis filter 
parameters {a,}, {b,}, and P associated with the current 
input frame. 

The transfer functions W(z), H/(z), and H^z) of the 
time-varying recursive filters 10, 13 and 14a, b are given 
by 


Wiz) = 
v ' P(z/y) 

(la) 

II 

*5 

(lb) 

-I? 

ii 

1 

(lc) 

where 


Piz) = 1 + £ a(z- l ,B{z) = 1 + i bp- 1 *- 1 , 
/— 1 /= —J 
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the a i are predictor coefficients obtained by a suitable 
LPC (linear predictive coding) analysis method of 
order p, the b / are predictor coefficients of a long-term 
LPC analysis of order q—2J+\, and the integer lag 
term P can roughly be described as the sample delay 5 
corresponding to one pitch period. The parameter y 
(0=y= 1) determines the amount of perceptual 
weighting applied to the error signal. The parameters 
{a/} are determined by a short-term LPC analysis 17 of 
a block of vectors, such as a frame of four vectors, each 10 
vector comprising 40 samples. The block of vectors is 
stored in an input buffer (not shown) during this analy- 
sis, and then processed to encode the vectors by select- 
ing the best match between a preprocessed input vector 
z„ and a synthetic vector 2 j, and transmitting only the 15 
index of the optimal excitation cy. After computing a set 
of parameters {a/} (e.g., twelve of them), inverse filter- 
ing of the input vector s n is performed using a short- 
term inverse filter 18 to produce a residual vector d«. 
The inverse filter has a transfer function equal to P(z). 20 
Pitch predictive analysis (long-term LPC analysis) 19 is 
then performed using the vector d n , where d« represents 
a succession of residual vectors corresponding to every 
vector s n of the block or frame. 

The perceptual weighting filter W(z) has been moved 25 
from its conventional location at the output of the error 2 
subtraction operation (adder 11) to both of its input 
branches. In this case, s n will be weighted once by W(z) 
(prior to the start of an excitation codebook search). In 
the second branch, the weighting function W(z) is in- 
corporated into the short-term synthesizer channel now 30 
labeled short-term weighted synthesizer 14. This con- 
figuration is mathematically equivalent to the conven- 
tional design, but requires less computation. A desirable 
effect of moving W(z) is that its zeros exactly cancel the 
poles of the conventional short-term synthesizer 14a 35 
(LPC filter) 1/P(z), producing the pth order weighted 
synthesis filter. 

™ = WW 40 

This arrangement requires a factor of 3 less computa- 
tions per codevector than the conventional approach 
since only k(p+q) multiply/adds are required for filter- 
ing a codevector instead of k(3p+q) when W(z) 45 
weights the error signal directly. The structure of FIG. 

1 is otherwise the same as conventional prior art VXC 
coders. 

Computation can be further reduced by removing the 
effect of the memory in the filters 13 and 14 (having the 50 
transfer functions H/(z) and H/z)) on the selection of an 
optimal excitation for the current vector of input 
speech. This is accomplished using a very low-com- 
plexity technique to preprocess the weighted input 
speech vector once prior to the subsequent codebook 55 
search, as described in the last section. The result of this 
procedure is that the initial memory in these filters can 
be set to zero when synthesizing {%} without affecting 
the choice of the optimal codevector. Once the optimal 
cod-evector is determined, filter memory from the pre- 60 
vious vector can be updated for encoding the subse- 
quent vector. This approach also allows the speech 
synthesis operation to be efficiently expressed as a ma- 
trix-vector product, as will now be described. 

For this method, called Sparse Vector Fast Search 65 
(SVFS), a new formulation of the LPC synthesis and 
weighting filters 13 and 14 is required. The following 
shows how a suitable algebraic manipulation and an 
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appropriate but modest constraint on the Gaussian-like 
codevectors leads to an overall reduction in codebook 
search complexity by a factor of approximately ten. The 
complexity reduction factor can be increased by vary- 
ing a parameter of the codebook construction process. 
The result is that the performance versus complexity 
characteristic exhibits a threshold effect that allows a 
substantial complexity saving before any perceptual 
degradation in quality is incurred. A side benefit of this 
technique is that memory storage for the excitation 
vectors is reduced by a factor of seven or more. Fur- 
thermore, codebook search computation is virtually 
independent of LPC filter order, making the use of 
high-order synthesis filters more attractive. 

It was noted above that memory terms in the infinite 
impulse response filters H/(z) and H/z) can be set to 
zero prior to synthesizing {z /}. This implies that the 
output of the filters 13 and 14 can be expressed as a 
convolution of two finite sequences of length k, scaled 
by a gain: 

z/m)=G/h(m)* c/m)), (2) 

z/m) is a sequence of weighted synthetic speech sam- 
ples, h(m) is the impulse response of the combined 
short-term, long-term, and weighting filters, and c/m) is 
a sequence of samples for the jth excitation vector. 

A matrix representation oFthe convolution in equa- 
tion (2) may be given as: 

lj— Gj Hey, (3) 

where H is a k by lower triangular matrix whose ele- 
ments are from h(m): 


\h(0) 

0 

0 ... 0 

\h(D 

h( 0 ) 

0 ... 0 

|A(2) 

h{ 1 ) 

h{ 0 ) ... 0 

I- 


... 0 

1 * 


... 0 

m - 1) 

h(k - 2) 

h{k — 3 ) ... h( 0 ) 


Now the weighted distortion from the jth codevector 
can be expressed simply as 

\m 2 =\\zn-2j\\ 2 =\\z n -Hcj\\l (5) 

In general, the matrix computation to calculate zy re- 
quires k(k+l)/2 operations of multiplication and addi- 
tion versus k(p-fq) for the conventional linear recursive 
filter realization For the chosen set of filter parameters 
(k=40, p+q = 19), it would be slightly more expensive 
for an arbitrary excitation vector c j to compute || ey || 
using the matrix formulation since (k-f l)/2>p+q. 
However, if each cy is suitably chosen to have only N p 
pulses per vector (the other components are zero), then 
equation (5) can be computed very efficiently. Typi- 
cally, N p /k is 0. 1. More specifically, if the matrix- vector 
product Hey is calculated using: 

For m = 0 to k— 1 
If c/m)—0, then 
Next m 
otherwise 
For i=m to k— 1 
%<0=%(0+c/(m) h(k). 
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Then the average computation for Hey is N p (k+ 1)/2 
multiply/adds, which is less than k(p+q) if N^<37 (for 
the k, p, and q given previously). 

A very straightforward pulse codebook construction 
procedure exists which uses an initial set of vectors 5 
whose components are all nonzero to construct a set of 
sparse excitation codevectors. This procedure, called 
center-clipping, is described in a later section. The com- 
plexity reduction factor of this SVFS is adjusted by 
varying N^, a parameter of the codebook design pro- 10 
cess. 

zeroing of selected codevector components is consis- 
tent with results obtained in Multi-Pulse LPC 
(MPLPC) [B. S. Atal and J. R. Remde “A New Model 
of LPC Excitation for Producing Natural-Sounding 
Speech at Low Bit Rates” Proc. Int’l. Conf. on Acous- 
tics, Speech, and Signal Processing, Paris, May 1982], 
since it has been shown that only about 8 pulses are 
required per pitch period (one pitch prriod is typically 
5 ms for a female speaker) to synthesize natural-sound- 
ing speech. See S. Singhal and B. S. Atal, “Improving 
Performance of Multi-Pulse LPC Coders at Low Bit 
Rates,” Proc. Int’l. Conf. on Acoustics, Speech and 
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candidate vectors to use in a reduced codevector 
search. Refer to FIGT4& for an expanded view of block 
40. The N c surviving codevectors are selected by mak- 
ing a rough classification of the gain-normalized spec- 
tral shape of the current speech frame into one of M* 
classes. One of M s corresponding codebooks (selected 
by the classification operation) is then used in a simpli- 
fied speech synthesis procedure to generate zy. The 
excitation vectors N c producing tjie lowest distortions 
are selected in block 40 for use in Step 2, the reduced 
exhaustive search using the scalar 30, long-term synthe- 
sizer 26, and short-term weighted synthesizer 25 (filters 
25a and 25 b in cascade as before). The only thing differ- 
ent is a reduced codevector set, such as 30 codevectors 
reduced from 1024. This is where computational sav- 
ings are achieved. 

Spectral classification of the current speech frame in 
block 40 is performed by quantizing its short-term pre- 
dictor coefficients using a vector quantizer 42 shown in 
FIG. 4 b with M* spectral shape codevectors (typically 
M*=4 to 8). This classification technique is very low in 
complexity (it comprises less than 0.2% of the total 
codebook search effort). The vector quantizer output 


Signal Processing, San Diego, March 1984. Even more (an index) selects one of M* corresponding codebooks to 
encouraging, simulation results of the present invention 25 use in the speech synthesis procedure (one codebook for 
indicate that reconstructed speech quality does not start each spectral class). To construct each shaped cook- 
to deteriorate until the number of pulses per vector book, Gaussian-like codevectors from a pulse excitation 
drops to 2 or 3 out of 40. Since, with the matrix formula- codebook 20 are input to an LPC synthesis filter 25a 

tion, computation decreases as the number of zero com- representing the codebook’s spectral class. The 

ponents increases, significant savings can be realized by 30 “shaped” codevectors are precomputed off-line and 
using only 4 pulses per vector. In fact, when N p =4 and stored in the codebooks 1, 2 . . . M 5 . By calculating the 

k=40, filtering complexity reduction by a factor of ten short-term filtered excitation off-line, this computa- 

is achieved. tional expense is saved in the encoder. Now the candi- 

FIG. la shows plots of segmental SNR (SNR seg ) and date excitation vectors from the original Gaussian-like 
overall codebook search complexity versus number of 35 codebook can be selected simply by filtering the shaped 
pulse per vector, N p . It is noted that as N p decreases, vectors from the selected class codebook with H/(z), 

SNR^g does not start to drop until reaches 3. In fact, and retaining only those N c vectors which produce the 

informal listening tests show that the perceptual quality lowest weighted distortion. In Step 2 of Spectral Classi- 
of the reconstructed speech signal actually improves fication, a final exhaustive search over these N c vectors 
slightly as N p is reduced from 40 to 4 and at the same 40 (to determine the optimal one) is conducted using quan- 
time, the filtering computation complexity drops signifi- tized values of the predictor coefficients determined by 
cantly. LPC analysis of the current speech frame. 

It should also be noted that the required amount of Computer simulation results show that with M 5 =4, 
codebook memory can be greatly reduced by storing N c can be as low as 30 with no loss in perceptual quality 
only N p pulse amplitudes and their associated positions 45 of the reconstructed speech, and when N c =10, only a 
instead of k amplitudes (most of which are zero in this very slight degradation is noticeable. FIG. lb summar- 
scheme). For example, memory storage reduction by a izes the results of these simulations by showing how 
factor of 7.3 is achieved when k=40, Np=4, and each SNR^gand overall codebook search complexity change 
codevector component is represented by a 16-bit word. with N c . Note that the drop in SNR 5eg as N c is reduced 
The second simplification (improvement), Spectral 50 does not occur until after the knee of the complexity 
Classification, also reduces overall codebook search versus N c curve is passed. 

effort by a factor of approximately ten. It is based on the The sparse-vector and spectral classification fast 
premise that it is possible to perform a precomputation codebook search techniques for VXC have each been 
of simple to moderate complexity using the input shown to reduce complexity by an order of magnitude 
speech to eliminate a large percentage of excitation 55 without incurring a loss in subjective quality of the 
codevectors from consideration before an exhaustive reconstructed speech signal. In the sparse-vector 
search is performed. method, a matrix formulation of the LPC synthesis 

It has been shown by other researchers that for a filters is presented which possesses distinct advantages 

given speech frame, the number of excitation vectors over conventional all-pole recursive filter structures. In 
from a codebook of size 1024 which produce acceptably 60 spectral classification, approximately 97% of the excita- 
low distortion is small (approximately 5). The goal in tion codevectors are eliminated from the codebook 
this fast-search scheme, is to use a quick but approxi- search by using a crude identification of the spectral 
mate procedure to find a number N c of “good” candi- shape of the current frame. These two methods can be 
date excitation vectors (N C <N) for subsequent use in a combined together or with other compatible fast-search 
reduced exhaustive search of N c codevectors. This two- 65 schemes to achieve even greater reduction, 
step operation is presented in FIG. 4a. These techniques for reducing the complexity of 

In Step 1, the input vector z n is compared with zy to Vector Excitation Coding (VXC) discussed above nn 
screen codevectors in block 40 and produce a set of N c general will now be described with reference to a par- 
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ticular embodiment called PVXC utilizing a pulse exci- 
tation (PE) codebook in which codevectors have been 
designed as just described with zeroing of selected 
codevector components to leave, for example, only four 
pulses, i.e., nonzero samples, for a vector of 40 samples. 5 
It is this pulse characteristic of PE codevectors that 
suggest the name “pulse vector excitation coder” re- 
ferred to as PVXC. 

PVXC is a hybrid speech coder which combines an 
analysis-by-synthesis approach with conventional 10 
waveform compression techniques. The basic structure 
of PVXC is presented in FIG. 2. The encoder consists 
of an LPC-based speech production model and an error 
weighting function W(z). The production model con- 
tains two time- varying, cascaded LPC synthesis filters 15 
Hs(z) and H/(z) describing the vocal tract, a codebook 
20 of N pulse-like excitation vectors c /, and a gain term 
Gy. As before, Hs(z) describes the spectral envelope of 
the original speech signal s«, and H/(z) is a long-term 
synthesizer which reproduces the spectral fine structure 2 o 
(pitch). The transfer functions of H^z) and H/(z) are 
given by Hs(z)= 1/P*(z) and H/(z)= l/P/(z) where 

PM = 1 + a/z- 1 * 25 

and 

Pl(z) = 1+2 biZ-P-K 

' =1 30 

Here, a/ and b, are the quantized short and long-term 
predictor coefficients, respectively, P is the “pitch” 
term derived from the short-term LPC residual signal 
(20^P^147), and p and q (=2J+ 1) are the short and 
long-term predictor orders, respectively. Tenth order 35 
short-term LPC analysis is performed on frames of 
length L=160 samples (20 ms for an 8 kHz sampling 
rate). P/(z) contains a 3-tap predictor (J=l) which is 
updated once per frame. The weighting filter has a 
transfer function W(z) = P^zj/P^z/y), where Pj(z) con- 
tains the unquantized predictor parameters and 
1. The purpose of the perceptual weighting filter 
W(z) is the same as before. 

Referring to FIG. 2, the basic structure of a PVXC 
system (encoder and decoder) is shown with the en- 
coder (transmitter) in the upper part connected to a 
decoder (receiver) by a channel 21 over which a pulse 
excitation (PE) codevector index and gain is transmit- 
ted for each input vector s n after encoding in accor- 
dance with this invention. Side information, consisting 
of the parameters Q{a;}, Q{b;}, QGyand P, are transmit- 
ted to the decoder once per frame (every L input sam- 
ples). The original speech input samples s, converted to 
digital form in an analog-to-digital converter 22, are 
partitioned into a frame of LA vectors, with each vec- 
tor having a group of k successive samples. More than 
one frame is stored in a buffer 23, which thus stores 
more than 160 samples at a time, such as 320 samples. 

For each frame, an analysis section 24 performs short- 
term LPC analysis and long-term LPC analysis to deter- 
mine the parameters {a,}, {b/} and P from the original 
speech contained in the frame. These parameters are 
used in a short-term synthesizer 25 a comprised of a 
digital filter specified by the parameters {a/}, and a 
perceptual weighting filter 25 b, and in a long-term syn- 
thesizer 26 comprised of a digital filter specified by four 
parameters {b/} and P. These parameters are coded 
using quantizing tables and only their indices Q{a/} and 


10 

Q{b/} are sent as side information to the decoder which 
uses them to specify the filters of long-term and short- 
term synthesizers 27 and 28, respectively, in recon- 
structing the speech. The channel 21 includes at its 
encoder output a multiplexer to first transmit the side 
information, and then the codevector indices and gains, 
i. e., the encoded vectors of a frame, together with a 
quantized gain factor QGy computed for each vector. 
The channel then includes at its output a demultiplexer 
to send the side information to the long-term and short- 
term synthesizers in the decoder. The quantized gain 
factor QGy of each vector is sent to a scaler 29 (corre- 
sponding to a scaler 30 in the encoder) with the de- 
coded codevector. 

After the LPC analysis has been competed for a 
frame, the encoder is ready to select an appropriate 
pulse excitation from the codebook 20 for each of the 
original speech vectors in the buffer 23. The first step is 
to retrieve one input vector from the buffer 23 and filter 
it with the perceptual weighting filter 33. The next step 
is to find the zero-input response of the cascaded en- 
coder synthesis filters 25a, b, and the long-term synthe- 
sizer 26. The computation required is indicated by a 
block 31 which is labeled “vector response from previ- 
ous frame”. Knowing the transfer functions of the long- 
term, short-term and weighting filters, and knowing the 
memory in these filters, a zero-input response h n is com- 
puted once for each vector and subtracted from the 
corresponding weighted input vector x n to produce a 
residual vector z„. This effectively removes the residual 
effects (ringing) caused by filter memory from past 
inputs. With the effect of the zero-input response re- 
moved, the initial memory values in H/(z) and H$(z) can 
be set to zero when synthesizing the set of vectors {zy} 
without effecting the choice of the optimal codevector. 
The pulse excitation codebook 32 in the decoder identi- 
cally corresponds to the encoder pulse excitation code- 
book 20. The transmitted indices can then be used to 
address the decoder PE codebook 32. 

The next step in performing a codebook search for 
each vector within one frame is to take all N PE code- 
vectors in the codebook, and using them as pulse excita- 
tion vectors cy, pass them one at a time through the 
scaler 30, long-term synthesizer 26 and short-term 
weighted synthesizer 25 in cascade, and calculate the 
vector zy that results for each of the PE codevectors. 
This is done N times for each new input vector z n . Next, 
the perceptually weighted vector z n is subtracted from 
the vector zyto produce an error ey. This is done for each 
of the N PE codevectors of the codebook 20, and the set 
of errors {e/} is stored in a block 34 which computes the 
Euclidean norm. The set {ey} is stored in the same in- 
dexed order as the PE codevectors {cy} so that when a 
search is made in a block 35 for the best-match i.e., least 
distortion, the index of that error e/ which produces the 
least distortion index can be transmitted to the decoder 
via the channel 21. 

In the receiver, the side information Q{b/} and Q{a/} 
received for each frame of vectors is used to specify the 
transfer functions H/(z) and H^z) of the long-term and 
short-term synthesizers 27 and 28 to match the corre- 
sponding synthesizers in the transmitter but without 
perceptual weighting. The gain factor QGy, which is 
determined to be optimum for each cy in the search for 
the least error index, is transmitted with the index, as 
noted above. Thus, while QGyis in essence side informa- 
tion used to control the scaling unit 29 to correspond to 



4,868,867 

11 12 

the gain of the scaling unit 30 in the transmitter at the lated from z n and zy in block 35a and quantized for trans- 
time the least error was found, it is not transmitted in a mission with the index in block 21. 

block with the parameters Q{a/} and Q{bJ. In the enhanced SVFS method, the fact is exploited 


The index of a PE codevector cy is received together 
with its associated gain factor to extract the identical 5 
PE codevector cy at the decoder for excitation of the 
synthesizers 27 and 28 . In that way an output vector s„ 
is synthesized which closely matches the vector zy that 
best matched z n (derived from the input vector s n ). The 
perceptual weighting used in the transmitter, but not the to 
reciever, shapes the spectrum of the error e/ so that it is 
similar to s». An important feature of this invention is to 
apply the perceptual weighting function to the PE 
codevector cy and to the speech vector s n instead of to 
the error ej. By applying the perceptual weighting fac- 15 
tor to both of the vectors at the input of the summer 
used to form the error ej instead of at the conventional 
location to the error signal directly, a number of advan- 
tages are achieved over the prior art. First, the error 
computation given in Eq. 5 can be expressed in terms of 20 
a matrix-vector product. Second, the zeros of the 
weighting filter cancel the poles of the conventional 
short-term synthesizer 25 a (LPC filter), producing the 
p th order weighted synthesis filter H^z) as noted herein- 
before with reference to FIG. 1 and Eq. 1. 25 

That advantage, coupled with the sparse vector cod- 
ing (i.e., zeroing of selected samples of a code-vector), 
greatly facilitates implementing the code-book search. 

An exhaustive search is performed for every input vec- 
tor s n to determine the excitation vector cy which mini- 30 
mizes the Euclidean distortion 1 1 ey 1 1 2 between z n and zy 
as noted hereinbefore. It is therefore important to mini- 
mize the number of operations necessarry in the best- 
match search of each excitation vector c :y. Once the 
optimal (best match) cy is found, the codebook index of 35 
the optimal cy is transmitted with the associated quan- 
tized gain QGy. 

Since the search for the optimal c ;y requires the most 
computation, the Sparse Vector Fast Search SVFS) 
technique, discussed hereinbefore, has been developed ^ 
as the basic PE codevector search for the optimal cy in 
PVXC speech or audio coders. An enhanced SVFS 
method combines the matrix formulation of the synthe- 
sis filters given above and a pulse excitation model with 
ideas proposed by I. M. Trancoso and B. S. Atal, “Effi- 45 
cient Procedures for Finding the Optimum Innovation 
in Stochastic Coders,” Proceedings Int’l Conference on 
Acoustics, Speech, and Signal Processing, Tokyo, April 
1986, to achieve substantially less computation per 
codebook search than either method achieves sepa- 5Q 
rately. Enhanced SVFS requires only 0.55 million mul- 
tiply/adds per second in a real-time implementation 
with a codebook size 256 and vector dimension 40. 

In Trancoso and Atal, it is shown that the weighted 
error minimization procedure associated with the selec- 
tion of an optimal codevector can be equivalently ex- 
pressed as a maximization of the following ratio: 

(z T H Cj ) 2 (v r Cj ) 2 ( 6 ) 

11 H CJ 11 2 Kaa<0)*4(0) + 2 *2 1 RhhWUi) 60 

1 = 1 

where R^(i) and R c </(i) are outocorrelations of the 
impulse response h(m) and the jth codevector cy, respec- 
tively. As noted by Trancoso and Atal, Gy no longer 65 
appears explicitly in Eq. (6): however, the gain is opti- 
mized automatically for each cy in the search procedure. 
Once an optimal index is selected, the gain can be calcu- 


that high reconstructed speech quality is maintained 
when the codevectors are sparse. In this case, cy and 
R c J(i) both contain many zero terms, leading to a signif- 
icantly simplified method for calculating the numerator 
and denominator in Eq. (6). Note that the R C c/(i) can be 
precomputed and stored in ROM memory together 
with the excitation codevectors cy. Furthermore, the 
squared Euclidean norms || H cy || 2 only need to be 
computed once per frame and stored in a RAM memory 
of size N words. Similarly, the vector v T =z T H only 
needs to be computed once per input vector. 

The codebook search operation for the PVXC of 
FIG. 2 suitable for implementation using programmable 
digital signal processor (DSP) chips, such as the AT&T 
DSP32, is depicted in FIG. 3. Here, the numerator term 
in Eq. (6) is calculated in block A by a fast inner product 
(which exploits the sparseness of cy). A similar fast inner 
product is used in the precomputation of the N denomi- 
nator terms in block B. The denominator on the right- 
hand side of Eq. (6) is computed once per frame and 
stored in a memory c. The numerator, on the other 
hand, is computed for every excitation codevector in 
the codebook. A codebook search is performed by find- 
ing the cy which maximizes the ratio in Eq. (6). At any 
point in time, registers B n and E^ contain the respective 
numerator and denominator ratio terms corresponding 
to the best codevector found in the search so far. Prod- 
ucts between the contents of the register E„ and Ed, and 
the numerator and denominator terms of the current 
codevector are generated and compared. Assuming the 
numerator N 1 and denominator D / are stored in the 
respective registers from the previous excitation vector 
cy_ 1 trial, and the numerator N 2 and denominator D 2 are 
now present from the current excitation vector cy trial, 
the comparison in block 60 is to determine if N 2 /D 2 is 
less than N//D/. Upon cross multiplying the numerators 
N / and N 2 with the denominators D/ and D 2 , we have 
N/D 2 and N 2 D/. The comparison is then to determine if 
N/D 2 >N 2 D/. If so, the ratio N//D/ is retained in the 
registers Eat and E</. If not, they are updated with N 2 
and D 2 . This is indicated by a dashed control line la- 
beled N/D 2 >N 2 D/. Each time the control updates the 
registers, it updates a register E with the index of the 
current excitation codevector cy. When all excitation 
vectors cy have been tested, the index to be transmitted 
is present in the register E. That register is cleared at the 
start of the search for the next vector z n . 

This cross-multiplication scheme avoids the division 
operation in Eq. (6), making it more suitable for imple- 
mentation using DSP chips. Also, seven times less mem- 
ory is required since only a few, such as four pulses 
(amplitudes and positions) out of 40 (in the example 
given with reference to FIG. 2) must be stored per 
codevector compared to 40 amplitudes for the case of a 
conventional Gaussian codevector. 

The data compaction scheme for storing the PE 
codebook and the PE autocorrelation codebook will 
now be described. One method for storing the code- 
book is to allocate k memory locations for each code- 
vector, where k is the vector dimension. Then the total 
memory required to store a codebook of size N is kN 
locations. An alternative approach which is appropriate 
for storing sparse codevectors is to encode and store 
only those samples in each codevector which are 
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nonzero. The zero samples need not be stored as they 
would have been if the first approach above were used. 

In the new technique, each nonzero sample is encoded 
as an ordered pair of numbers (a,l). The first number a 
corresponds to the amplitude of the sample in the code- 5 
vector, and the second number 1 identifies its location 
within the vector. The location number is typically an 
integer between 1 and k, inclusive. 

If it is assumed that each location 1 can be stored using 
only one-half of a single memory location (as is reason- 10 
able since 1 is typically only a six-bit word), then the 
total memory required to store a PE codebook is 
CNp+Np/ 2) N=1.5 N^N locations. For a PE codebook 
with dimension 40, and with Np=4, a savings factor of 
7 is achieved compared to the first approach just given 15 
above. Since the PE autocorrelation codebook is also 
sparse, the same technique can also be used to efficiently 
store it. 

A preferred embodiment of the present invention will 
now be described with a reference to FIG. 5 which 20 
illustrates an architecture implemented with a program- 
mable signal processor, such as the AT&T DSP32. The 
first stage 51 of the encoder (transmitter) is a low-pass 
filter, and the second stage 52 is a sample-and-hold type ^ 
of analog-to-digital converter. Both of these stages are 
implemented with commercially available integrated 
circuits, but the second stage is controlled by a pro- 
grammable digital signal processor (DSP). 

The third stage 53 is a buffer for storing a block of 160 3Q 
samples partitioned into vectors of dimension k=40. 
This buffer is implemented in the memory space of the 
DSP, which is not shown in the block diagram; only the 
functions carried out by the DSP are shown. The buffer 
thus stores a frame of four vectors of dimension 40. In 35 
practice, two buffers are preferably provided so that 
one may receive and store samples while the other is 
used in coding the vectors in a frame. Such double 
buffering is conventional in real-time digital signal pro- 
cessing. 40 

The first step in vector encoding after the buffer is 
filled with one frame of vectors is to perform short-term 
linear predictive coding (LPC) analysis on the signals in 
block 54 to extract from a frame of vectors a set of ten 


After the LPC analysis has been completed for a 
frame of four vectors, 40 samples per vector for a total 
of 160 samples, the encoder is ready to select an appro- 
priate excitation for each of the four speech vectors in 
the analyzed frame. The first step in the selection pro- 
cess is to find the impulse response h(n) of the cascaded 
short-term and long-term synthesizers and the 
weighting filter. That is accomplished in a block 59 
labeled “filter characterization,” which is equivalent to 
defining the filter characteristics (transfer functions) for 
the filters 25 and 26 shown in FIG. 2. The impulse 
response h(n) corresponding to the cascaded filters is 
basically a linear systems characterization of these fil- 
ters. 

Keeping in mind that what has been described thus 
far is in preparation for doing a codebook search for 
four successive vectors, one at a time within one frame, 
the next preparatory step is to compute the Euclidean 
norm of synthetic vectors in block 60. Basically, the 
quantities being calculated are the energy of the syn- 
thetic vectors that are produced by filtering the PE 
codevectors from a pulse excitation codebook 63 
through the cascaded synthesizers shown in FIG. 2. 
This is done for all 256 code vectors one time per frame 
of input speech vectors. These quantities, 1 1 Hey 1 1 2 , are 
used for encoding all four speech vectors within one 
frame. The computation for those quantities is given by 
the following equation: 

II Zj II 2 = II Hcj II 2 = R hh m J cc + 2 *2 ! Rhh(n)R{M) 0) 

n — 1 

where H is a matrix which contains elements of the 
impulse response, cy is one excitation vector, and 


k-\ 

RhhiO = 2 h(n)h(n + i) 
n = 0 


k - 1 

: 2 -cpi)cfn + i). 
n=0 


parameters {a/}. These parameters are used to define a 45 So, the quantities || Hey || 2 are computed using the 
filter in block 55 for inverse predictive filtering. The values R c </(i), the autocorrelation of cy. The squared 
transfer function of this inverse predictive filter is equal Euclidean norm 1 1 Hey || 2 at this point is simply the 
to P(z) of Eq. 1. These blocks 54, 55, and 56 correspond energy of zy shown in FIG. 2. Thus, the precomputation 
to the analysis section 24 of FIG. 2. Together they in block 60 is effectively to take every excitation vector 
provide all the preliminary analysis necessary for each 50 from the pulse excitation codebook 63, scale it with a 
successive frame of the input signal s n to extract all of gain factor of 1, filter it through the long-term synthe- 

the parameters {a/}, {b /} and P. sizer, the short-term synthesizer, and the weighting 

The inverse predictive filtering process generates a filter, calculate the synthetic speech vector zy, and then 

signal r, which is the residual remaining after removing calculate the energy of that vector. This computation is 

redundancy from the input signal s. Long-term LPC 55 done before doing a pulse excitation codebook search in 
analysis is then performed on the residual signal r in accordance with Eq. (7). 

block 56 to extract a set of four parameters {b/} and P. From this equation it is seen that the energy of each 
The value P represents a quasi-pitch term similar to the synthetic vector is a sum of products involving the 
one pitch period of speech which ranges from 20 to 147. autocorrelation of impulse response R hh and the auto- 
A perceptual weighting filter 57 receives the input 60 correlation of the pulse excitation vector for the partic- 
signal sn This filter also receives the set of parameters ular synthetic vector R C J- The energy is computed for 

{a/} to specify its transfer function W(z) in Eq. 1. each 'ey. The parameter i in the equations for R C J and 

The parameters {a,}, {b/} and P are quantized using a R hh indicates the length of shift for each product in a 

table, and coded using the index of the quantized param- sequence in forming the sum of products. For example, 

eters. These indices are transmitted as side information 65 if i—0, there is no shift, and summing the products is 
through a multiplexer 67 to a channel 68 that connects equivalent to squaring and accumulating all of the terms 

the encoder to a receiver in accordance with the archi- within two sequences. If there is a sequence of length 5, 

tecture described with reference to FIG. 2. i.e., if there are five samples in the sequence, the auto- 
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correlation for i=0 is found by producing another copy 
of the sequence of samples, multiplying the two sequen- 
ces of samples, and summing the products. That is indi- 
cated in the equation by the summation of products. For 
i= 1, one of the sequences is shifted by one sample, and 5 
then the corresponding terms are multiplied and added. 
The number of samples in a vector is k=40, so i ranges 
from 0 up to 39 in integers. Consequently, 1 1 Hey 1 1 2 is a 
sum of products between two autocorrelations: one 
autocorrelation is the autocorrelation of the impulse 10 
response, R hh> and the other is the autocorrelation of the 
pulse excitation vector R «/. The j symbol indicates that 
it is the ] th pulse excitation vector. It is more efficient to 
synthesize vectors at this point and calculate their ener- 
gies, which are stored in the block 60, than to perform 15 
the calculation in the more straightforward way dis- 
cussed above with reference to FIG. 2. Once these 
energies are computed for 256 vectors in the codebook 
61, the pulse excitation codebook search represented by 
block 62 may commence, using the predetermined and 20 
permanent pulse excitation codebook 63, from which 
the pulse excitation autocorrelation codebook is de- 
rived. In other words, after precomputing (designing) 
and storing the permanent pulse excitation vectors for 
the codebook 63, a corresponding set of autocorrelation 25 
vectors Rcc are computed and stored in the block 61 for 
encoding in real time. 

In order to derive the input vector z n to the excitation 
codebook search, the speech input vector s n from the 
buffer 53 is first passed through the perceptual 30 
weighting filter 57, and the weighted vector is passed 
through a block 64 the function of which is to remove 
the effect of the filter memory in the encoder synthesis 
and weighting filters, i.e., to remove the zero-input 
response (zIR) in order to present a vector z n to the 35 
codebook search in block 62. 

Before describing how the codebook search is per- 
formed, reference should be made to FIG. 3. The bot- 
tom part of that figure shows how the precomputation 
of the energy of the synthetic vector is carried out. Note 40 
that there is a correlation between Eq. (8) and block B 
in the bottom part of this figure. In accordance with Eq. 

(8), the autocorrelation of the pulse vector and the auto- 
correlation of the impulse response are used to compute 
|| He/ 1| 2 , and the results are stored in a memory c of 45 
size N, where N is the codebook size. For each pulse 
excitation vector, there is one energy value stored. 

As just noted above with reference to FIG. 5, these 
quantities R C J can be computed once and stored in 
memory as well as the pulse excitation vectors of the 50 
codebook in block 63 of FIG. 5. That is, these quantities 
R C J are a function of whatever pulse excitation code- 
book is designed, so they do not need to be computed 
on-line. It is thus clear that in this embodiment of the 
invention, there are actually two codebooks stored in a 55 
ROM. One is a pulse excitation codebook in block 63, 
and the second is the autocorrelation of those codes in 
block 61. But the impulse response is different for every 
frame. Consequently, it is necessary to compute Eq. (8) 
to find N terms and store them in memory c for the 60 
duration of the frame. 

In selecting an optimal excitation vector, Eq. (6) is 
used. That is essentially equivalent to the straightfor- 
ward approach described with reference to FIG. 2, 
which is to take each excitation, filter it, compute a 65 
weighted error vector and its Euclidean norm, and find 
an optimal excitation. By using Eq. (6), it is possible to 
calculate for each PE codevector the denominator of 


Eq. (6). Each 1 1 Hey || 2 term is then simply called out of 
memory as it is needed once it has been computed. It is 
then necessary to compute on line the numerator of Eq. 
(6), which is a function of the input speech, because 
there is a vector z in the equation. The vector vT, 
where T denotes a vector transpose operation, at the 
output of a correlation generator block 65 is equvalent 
to z^H. And v is calculated as just a sum of products 
between the impulse response hn of the filter and the 
input vector z n . So for the v T , we substitute the folio w- 
ing: 

k— 1 ( 9 ) 

KO = 2 h(n)z(n + L) W 

n = 0 

Consequently, Eq. (6) can be used to select an optimal 
excitation by calculating the numerator and precalculat- 
ing the denominator to find the quotient, and then find- 
ing which pulse excitation vector maximizes this quo- 
tient. The denominator can be calculated once and 
stored, so all that is necessary is to pre compute v, per- 
form a fast inner product between c and v, and then 
square the result. Instead of doing a division every time 
as Eq. (6) would require, an equivalent way is to do a 
cross product as shown in FIG. 3 and described above. 

This block diagram of FIG. 5 is actually more de- 
tailed than shown and described with reference to FIG. 
2. The next problem is how to keep track of the index 
and keep track of which of these pulse excitation vec- 
tors is the best. That is indicated in FIG. 5. 

In order to perform the excitation codebook search, 
what is needed is the pulse excitation code cy from the 
codebook 63 itself, and the v vector from block 64. Also 
needed are the energies of the synthetic vectors pre- 
computed once every frame coming from block 60. 
Now assuming an appropriate excitation index has been 
calculated for an input vector s„, the last step in the 
process of encoding every excitation is to select a gain 
factor Gy in block 66. A gain factor Gy has to be selected 
for every excitation. The excitation codebook search 
takes into account that this gain can vary. Therefore in 
the optimization procedure for minimizing the percep- 
tually weighted error, a gain factor is picked which 
minimizes the distortion. An alternative would be to 
compute a fixed gain prior to the codebook search, and 
then use that gain for every excitation vector. A better 
way is to compute an optimal gain factor Gy for each 
codevector in the codebook search and then transmit an 
index of the quantized gain associated with the best 
codevector cy. That process is automatically incorpo- 
rated into Eq. (6). In other words, by maximizing the 
ratio of Eq. (6), the gain is automatically optimized as 
well. Thus, what the encoder does in the process of 
doing the codebook search is to automatically optimize 
the gain without explicitly calculating it. 

The very last step after the index of an optimal excita- 
tion codevector is selected is to calculate the optimal 
gain used in the selection, which is to say compute it 
from collected data in order to transmit its index from a 
gain quantizing table. It is a function of z, as shown in 
the following equation: 

k - 1 ( 10 ) 

2 2j(n)z(n) 

~ n—Q 

Gj ~ k - 1 

s \?m 2 

n = 0 
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The gain computation and quantization is carried out in 
block 66. 

From Eq. (10) it is seen that the gain is a function of 
z(n) and the current synthetic speech vector z/n). Con- 
sequently, it is possible to derive the gain Gy by calculat- 5 
ing the crosscorrelation between the synthetic speech 
vector zj and the input vector z n This is done after an 
optimal excitation has been selected. The signal z/n) is 
computed using the impulse response of the encoder 
synthesis and weighting filters, and the optimal excita- 10 
tion vector cy. Eq. (10) states that the process is to syn- 
thesize a synthetic speech vector using an optimal exci- 
tation, calculate the crosscorrelation between original 
speech and that synthetic vector, and then divide it by 
the energy in the synthetic speech vector that is the sum 15 
of the squares of the synthetic vector z/n) 2 . That is the 
last step in the encoder. 

For each frame, the encoder provides (1) a collection 
of long-term fdter parameters {b,}and P, (2) short-term 
filter parameters {a/}, (3) a set of pulse vector excitation 20 
indices, each one of length log 2 N bits, and (4) a set of 
gain factors, with one gain for each of the pulse excita- 
tion vector indices. All of this is multiplexed and trans- 
mitted over the channel 68. The decoder simply demul- 
tiplexes the bit stream it receives. 25 

The decoder shown in FIG. 2 receives the indices, 
gain factors, and the parameters {a/}, {b,*}, and P for the 
speech production synthesizer. Then it simply has to 
take an index, do a table lookup to get the excitation 
vector, scale that by the gain factor, pass that through 30 
the speech synthesizer filter and then, finally, perform 
D/A conversion and low-pass filtering to produce the 
reconstructed speech. 

A conventional Gaussian codebook of size 256 can- 
not be used in VXC without incurring a substantial drop 35 
in reconstructed signal quality. At the same time, no 
algorithms have previously been shown to exist for 
designing an optimal codebook for VXC-type coders. 
Designed excitation codebooks are optimal in the sense 
that the average perceptually-weighted error between 40 
the original and synthetic speech signals is minimized. 
Although convergence of the codebook design proce- 
dure cannot be strictly guaranteed, in practice large 
improvement is gained in the first few iteration steps, 
and thereafter the algorithm can be halted when a suit- 45 
able convergence criterion is satisfied. Computer simu- 
lations show that both the segmental SNR and percep- 
tual quality of the reconstructed speech increase when 
an optimized codebook is used (compared to a Gaussian 
codebook of the same size). An algorithm for designing 50 
an optimal codebook will now be described. 

The flow chart of FIG. 6 describes how the pulse 
excitation codebook is designed. The procedure starts 
in block 1 with a speech training sequence using a very 
long segment of speech, typically eight minutes. The 55 
problem is to analyze that training segment and prepare 
a pulse excitation codebook. 

The training sequence includes a broad class of speak- 
ers (male, female, young, old). The more general this 
training sequence, the more robust the codebook will be 60 
in an actual application. Consequently, this training 
sequence should be long enough to include all manner 
of speech and accents. The training sequence is an itera- 
tive process. It starts with one excitation codebook. For 
example, it can start with a codebook having Gaussian 65 
samples. The technique is to iteratively improve on it, 
and when the algorithm has converged, the iterative 
process is terminated. The permanent pulse excitation 
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codebook is then extracted from the output of this itera- 
tive algorithm. 

The iterative algorithm produces an excitation code- 
book with fully-populated codevectors. The last step 
center clips those codevectors to get the final pulse 
excitation codebook. Center clipping means to elimi- 
nate small samples, i.e., to reduce all the small amplitude 
samples to zero, and keep only the largest, until only the 
largest samples remain in each vector. In summary, 
having a sequence of numbers to construct a pulse exci- 
tation codevector, the final step in the iterative process 
to construct a pulse excitation codebook is to retain out 
of k samples the N p samples of largest amplitude. 

Design of the PE codebook 63 shown in FIG. 5 will 
now be described in more detail with reference to FIG. 
6. The first step in the iterative technique is to basically 
encode the training set. Prior to that there has been 
made available (in block 1) a very long segment of 
original speech. That long segment of speech is ana- 
lyzed in block 2 to produce m input vectors z n from the 
training sequence Next the coder of FIG. 5 is used to 
encode each of these m input vectors. Once the se- 
quence of vectors z n are available, a clustering opera- 
tion is performed in block 3. That is done by collecting 
all of the input vectors z n which are associated with one 
particular codevector. 

Assuming completion of encoding this whole training 
sequence, and assuming the first excitation vector is 
picked as the optimal one for 10 training set vectors, and 
the second one is selected 20 times, for the case of the 
first vector, those 10 input vectors are grouped together 
and associated with the first excitation vector ci. For 
the next excitation, all the input vectors which were 
associated with it are grouped together, and this gener- 
ates a cluster of z vectors. So for every element in the 
codebook there is a cluster of z vectors. Once a cluster 
is formed, a “centroid” is calculated in block 4. 

What “centroid” means will be explained in terms of 
a two-dimensional vector, although a vector in this 
invention may have a dimension of 40 or more. Suppose 
the two-dimensional codevectors are represented by 
two dots in space, with one dot placed at the origin. In 
the space of all two-dimensional vectors, there are N 
codevectors. In encoding the training sequence, the 
input could consist of many input vectors scattered all 
over the space. In a clustering procedure, all of the 
input vectors which are closest to one codevector are 
collected by bringing the various closest vectors to that 
one. Other input vectors are similarly clustered with 
other codevectors. This is the encoding process repre- 
sented by blocks 2 and 3 in FIG. 6. The steps are to 
generate the input vectors and cluster them. 

Next, a centroid is to be calculated for each cluster in 
block 4. A centroid is simply the average of all vectors 
clustered, i.e., it is that vector which will produce the 
smallest average distortion between all these input vec- 
tors and the centroid itself. 

There is some distortion between a given input vector 
and a codevector, and there is some distortion between 
other input vectors and their associated codevector. If 
all the distortions associated with one codevector are 
summed together, a number will be generated repre- 
senting the distortion for that codevector. A centroid 
can be calculated based on these input vectors by deter- 
mining which will do a better job of reconstructing the 
input vectors than the original codevector. If it is the 
centroid, then the summation of the distortions between 
that centroid and the input vectors in the cluster will be 
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minimum . Since this centroid could do a better job of 
representing these vectors than the original codevector, 
it is retained by updating the corresponding excitation 
codebook location in block 5. So this is the codevector 
ultimately retained in the excitation codebook. Thus, in 5 
this step of the codebook design procedure, the original 
Gaussian codevector is replaced by the centroid. In that 
manner, a new code-vector is generated. 

For the specific case of VXC, the centroid derivation 
is based on the following set of conditions. Starting with 10 
a cluster of M elements, each consisting of a weighted 
speech vector z/, a synthesis filter impulse response 
sequence h /, and a speech model gain G/, denote one 
z/-h { <m)-G/ triplet as (zr, hr, G,), l^i^M. The objective 
is to find the centroid vector u for the cluster which I 5 
minimizes the average squared error between z / and 
G/H/u, where H,* is the lower triangular matrix de- 
scribed (Eq. 4). 

The solution to this problem is similar to a linear-least 
squares result: 20 


M , r M „ 

2 G t 2 Hi T Hi u = 2 GiHfzi. 
i= 1 t=\ 


Eq. (11) states that the optimal u is determined by sepa- 
rately accumulating a set of matrices and vectors corre- 
sponding to every (z r, hr G/) in the cluster, and then 
solving a standard linear algebra matrix equation 
(Ax=b). 30 

For every codevector in the codebook, each cluster 
of codevectors has another centroid, so then another 
centroid is developed eliminating the previous as a 
codevector, thus constructing a codebook that will be 
better representative of this input training set than the 35 
original codebook. This procedure is repeated over and 
over, each time with a new codebook to encode the 
training sequence, calculate centroids and replace the 
codevectors with their corresponding centroids. That is 
the basic iterative procedure shown in FIG. 6 . The idea 40 
is to calculate a centroid for each of the N codevectors, 
where N is the codebook size, then update the excita- 
tion codebook and check to see if convergence has been 
reached. If not, the procedure is repeated for all input 
vectors of the training sequence until convergence has 45 
been achieved. If not, the procedure may go back to 
block 2 (closed-loop iteration) or to block 3 (open-loop 
iteration). Then in block 6 , the final codebook is center 
clipped to produce the pulse excitation codebook. That 
is the end of the pulse excitation codebook design pro- 50 
cedure. 

By eliminating the last step, wherein a pulse code- 
book is constructed (i.e., by retaining the design excita- 
tion codebook after the convergence test is satisfied), a 
codebook having fully populated code vectors may be 55 
obtained. Computer simulation results have shown that 
such a codebook will give superior performance com- 
pared to a Gaussian codebook of the same size. 

A vector excitation speech coder has been described 
which achieves very high reconstructed speech quality 60 
at low bit-rates, and which requires 800 times less com- 
putation than earlier approaches. Computational sav- 
ings are achieved primarily by incorporating fast-search 
techniques into the coder and using a smaller, optimized 
excitation codebook. The coder also requires less total 65 
codebook memory than previous designs, and is well- 
structured for real-time implementation using only one 
of today's programmable digital signal processor chips. 


The coder will provide high-quality speech coding at 
rates between 4000 and 9600 bits per second. 

What is claimed is: 

1. An improvement in the method for compressing 
digitally encoded speech or audio signal by using a 
permanent indexed codebook of N predetermined exci- 
tation vectors of dimension k, each having an assigned 
codebook index j to find indices which identify the best 
match between an input speech vector s« that is to be 
coded and a vector c / from a codebook, where the sub- 
script j is an index which uniquely identifies a codevec- 
tor in said codebook, and the index of which is to be 
associated with the vector code, comprising the steps of 

buffering and grouping said vectors into frames of L 
samples, with L/k vectors for each frame, . 
performing initial analyses for each successive frame 
to determine a set of parameters for specifying 
long-term synthesis filtering, short-term synthesis 
filtering, and perceptual weighting, 
computing a zero-input response of a long-term syn- 
thesis filter, short-term synthesis filter, and percep- 
tual weighting filter, 

perceptually weighting each input vector s* of a 
frame and subtracting from each input vector s n 
said zero input response to produce a vector z«, 
obtaining each codevector cyfrom said codebook one 
at a time and processing each codevector cy 
through a scaling unit, said unit being controlled by 
a gain factor Gy, and further processing each scaled 
codevector c y through a long-term synthesis filter, 
short-term synthesis filter and perceptual 
weighting filter in cascade, said cascaded filters 
being controlled by said set of parameters to pro- 
duce a set of estimates zy of said vector z n , one 
estimate for each codevector cy, 
finding the estimate zy which best matches the vector 
Z/i, 

computing a quantized value of said gain factor Gy 
using said vector z n and the estimate zy which best 
matches z n , 

pairing together the index j of the estimate zy which 
best matches z n and said quantized value of said 
gain factor Gy as index-gain pairs for later recon- 
struction of said digitally encoded speech or audio 
signal, 

associating with each frame said index-gain pairs 
from said frame along with the quantized values of 
said parameters otained by initial analysis for use in 
specifying long-term synthesis filtering and short- 
term synthesis filtering in said reconstruction of 
said digitally encoded speech or audio signal, and 
during said reconstruction, reading out of a codebook 
a codevector cy that is identical to the codevector 
cy used for finding said best estimate by processing 
said reconstruction codevector cy through said sca- 
lar and said cascaded long-term and short-term 
synthesis filters. 

2. An improvement in the method for compressing 
digitally encoded speech as defined in claim 1 wherein 
said codebooks are made sparse by extracting vectors 
from an initial arbitrary codebook, one at a time, and 
setting all but a selected number of samples of highest 
amplitude values in each vector to zero amplitude val- 
ues, thereby generating a sparse vector with the same 
number of samples as the initial vector, but with only 
said selected number of samples having nonzero values. 

3. An improvement in the method for compressing 
digitally encoded speech as defined in claim 1 by use of 
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a codebook to store vectors cy, where the subscript j is 
an index for each vector stored, a method for designing 
an optimum codebook using an initial arbitrary code- 
book and a set of m speech training vectors sn by pro- 
ducing for each vector s n in sequence said perceptually 5 
weighted vector z n , clustering said m vectors z n , calcu- 
lating N centroid vectors from said m clustered vectors, 
where N<m, update said codebook by replacing N 
vectors c / with vector s n used to produce vector z n 
found to be a best match with said vector zy at index 10 
location j, and testing for convergence between the 
updated codebook and said set of m speech training 
vectors s n , and if convergence has not been achieved, 
repeating the process using the updated codebook until 
convergence is achieved. 15 

4. An improvement as defined in claim 3, including a 
final step of center clipping vectors in the last updated 
codebook vector by setting to zero all but a selected 
number of samples of lowest amplitude in each vector 

c /, and leaving in each vector c j only said selected num- 20 
ber of samples of highest amplitude by extracting the 
vectors of said last updated codebook, one at a time, and 
setting all but a selected number of samples of highest 
amplitude values in each vector to amplitude values of 
zero, thereby generating a sparse vector with the same 25 
number of samples as the last updated vector, but with 
only said selected number of samples having nonzero 
values. 

5. An improvement as defined in claim 1 comprising 

a two-step fast search method wherein the first step is to 30 
classify a current speech frame prior to compressing by 
selecting one of a plurality of classes to which the cur- 
rent speech frame belongs, and the seocnd step is to use 
a selected one of a plurality of reduced sets of codevec- 
tors to find the best match been each input vector z/and 35 
one of the codevectors of said selected reduced set of 
codevectors having a unique correspondence between 
every codevector in the set and particular vectors in 
said permanent indexed codebook, whereby a reduced 
exhaustive search is achieved for processing each input 4 0 
vector z ,* of a frame by first classifying the frame and 
then using a reduced codevector set selected from the 
permanent index codebook for every input vector of the 
frame. 

6. An improvement as defined in claim 5 wherein 45 
classification of each frame is carried out by examining 
the spectral envelope parameters of the current frame 
and comparing said spectral envelope parameters with 
stored vector parameters for all classes in order to select 
one of said plurality of reduced sets of code vectors. 50 

7. An improvement as defined in claim 1, wherein the 
step of computing said quantized value of said gain 
factor Gy and the estimate that best matches z n is carried 
out by calculating the cross-correlation between the 
estimate zy and said vector z n , and dividing the cross- 55 
correlation product of said vector z n and said estima ±j 

in accordance with the following equation: 

r !!>)*'> 60 

V [ 2»] 2 

n—0 

where k is the number of samples in a vector. 

8. An improvement in the method for compressing 65 
digitally encoded speech or audio signal by using a 
permanent indexed codebook of N predetermined exci- 
tation vectors of dimension k, each having an assigned 
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codebook index j to find indices which identify the best 
match between an input speech vector s n that is to be 
coded and a vector c / from a codebook, where the sub- 
script j. is an index which uniquely identifies a codevec- 
tor in said codebook, and the index of which is to be 
associated with the vector code, comprising the steps of 
designing said codebook to have sparse vectors by 
extracting vectors from an initial arbitrary code- 
book, one at a time, and setting to zero value all but 
a selected number of samples of highest amplitude 
values in each vector, thereby generating a sparse 
vector with the same number of samples as the 
initial vector, but with only said selected number of 
samples having nonzero values, 
buffering and grouping said vectors into frames of L 
samples, with L/k vectors for each frame, 
performing initial analyzes for each successive frame 
to determine a set of parameters for specifying 
long-term synthesis filtering, short-term synthesis 
filtering, and perceptual weighting, 
computing a zero-input response of a long-term syn- 
thesis filter, short-term synthesis filter, and percep- 
tual weighting filter, 

perceptually weighting each input vector s n of a 
frame and subtracting from each input vector s n 
said zero input response to produce a vector z n , 
obtaining each codevector cyfrom said codebook one 
at a time and processing each codevector cy 
through a scaling unit, said unit being controlled by 
a gain factor Gy, and further processing each scaled 
codevector cy through a long-term synthesis filter, 
short-term synthesis filter, said cascaded filters 
being controlled by said set of parameters to pro- 
duce a set of estimates z / of said vector z n , one 
estimate for each codevector cy, 
finding the estimate zy which best matches the vector 

computing a quantized value of said gain factor Gy 
using said vector z n and the estimate zy which best 
matches z n 

pairing together the index j of the estimate zy which 
best matches z n and said quantized value of said 
gain factor Gy for later reconstruction of said digi- 
tally encoded speech or audio signal, 
associating with each frame said index-gain pairs 
from said frame along with the quantized values of 
said parameters obtained by initial analysis for use 
in specifying long-term synthesis filtering and 
short-term synthesis filtering in said reconstruction 
of said digitally encoded speech or audio signal, 
and 

during said reconstruction, reading out of a codebook 
a code vector cythat is identical code vector cy used 
for finding said best estimate by processing said 
reconstruction codevector cy through said scalar 
and said cascaded long-term and short-term syn- 
thesis filters. 

9. An improvement in the method for compressing 
digitally encoded speech as defined in claim 8 by use of 
a codebook to store vectors cy, where the subscript j is 
an index for each vector stored, a method for designing 
an optimum codebook using an initial arbitrary code- 
book and a set of m speech training vectors s n by pro- 
ducing for each vector s n in sequence said perceptually 
weighted vector z n , clustering said m vectors z„, calcu- 
lating N centroid vectors from said m clustered vectors, 
where N<m, update said codebook by replacing N 
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vectors cy with vector $„ used to produce vector z„ 
found to be a best match with said vector zy at index 
location j, and testing for convergence between the 
updated codebook and said set of m speech training 
vectors s«, and if convergence has not been achieved, 5 
repeating the process using the updated codebook until 
convergence is achieved. 

10. An improvement as defined in claim 9, including 
a final s of extracting the last updated vectors, one at a 
time, and setting to zero value all but a selected number 10 
of samples of highest amplitude values in each vector, 
thereby generating a sparse vector with the same num- 
ber of samples as the last updated vetor, but with only 
said selected number of samples with nonzero values. 

11. An improvement as defined in claim 8 comprising 15 
a fast search method using said codebook to select a 
number N c of good excitation vectors cy, where N c is 
much smaller than N, and using said vectors N c for an 
exhaustive search to find the best match between said 
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vector z n and estimate vector zy produced from a code- 
vector Cj included in said N c codebook vectors by pre- 
computing N vectors zj, comparing an input vector z n 
with vectors zy, and producing a codebook of N c code- 
vectors for use in an exhaustive search of the best match 
between said input vector z n and a vector zy from a 
codebook of N c vectors. 

12. An improvement as defined in claim 11 wherein 
said N c codebook is produced by making rough classifi- 
cation of the gain-normalized spectral shape of a current 
speech frame into one of M s spectral shape classes, and 
selecting one of M s shaped codebooks for encoding an 
input vector z n by comparing said input vector with the 
Zj vectors stored in the selected one of the M s shaped 
codebooks, and then taking the N c condevectors which 
produce the N c smallest errors for use in said N c code- 
book. 

***** 
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